Life expectancy is a crucial indicator of the overall health of a population. Immunizations, such as DTP and measles vaccines, are believed to contribute to public health improvements and reduce mortality from these preventable diseases. In our project, we will explore the relationship between immunization and life expectancy across different countries, using the 2020 United Nations Human Development Data.
The key question is: Are higher vaccination rates correlate with increased life expectancy across different countries?
One row had to be skipped because row 2 was the true headers of this data.
life_data <- read.csv("dataset/UNHDD 2020.csv", skip = 1)
head(life_data)
In the dataset, missing values are represented as “..”. We will replace empty values with Null. We will also save the cleaned dataset into a clean_data folder for further use.
# Replace ".." with NA
life_data[life_data == ".."] <- NA
life_data <- life_data %>% mutate_if(is.character, as.numeric)
## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `country = .Primitive("as.double")(country)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
# View the first few rows to verify
head(life_data)
# Write file
write_csv(life_data, "dataset/life_data.csv")
We selected key variables of interest for further analysis: life expectancy and vaccination rates (DTP and measles).
selected_data <- life_data %>%
select(country, life, dtp_vax_no, measles_vax_no)
head(selected_data)
## country life dtp_vax_no measles_vax_no
## 1 NA 82.40 1 3
## 2 NA 82.31 2 9
## 3 NA 83.78 2 5
## 4 NA 84.86 NA NA
## 5 NA 82.99 3 7
## 6 NA 81.33 2 3
We will visualize the distribution of vaccination rates and their relationship with life expectancy.
# Scatter plot for DTP vaccinations vs life expectancy
scatter_plot <- ggplot(life_data, aes(x = dtp_vax_no, y = life)) +
geom_point(color = "black", fill = "blue", shape = 21, size = 3) +
labs(
x = "DTP Vaccinations",
y = "Life Expectancy"
) +
ggtitle("Relationship between DTP Vaccinations and Life Expectancy")
print(scatter_plot)
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
# Scatter plot for Measles vaccinations vs life expectancy
scatter_plot2 <- ggplot(life_data, aes(x = measles_vax_no, y = life)) +
geom_point(color = "black", fill = "blue", shape = 21, size = 3) +
labs(
x = "Measles Vaccinations",
y = "Life Expectancy"
) +
ggtitle("Relationship between Measles Vaccinations and Life Expectancy")
print(scatter_plot2)
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
# Histogram for the DTP Vaccine
## Convert vaccination numbers to numeric to generate histograms
life_data$dtp_vax_no <- as.numeric(life_data$dtp_vax_no)
## Check for any conversion warnings or issues
summary(life_data$dtp_vax_no)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 1.000 3.000 7.316 8.500 56.000 2
## Now we create the histogram for DTP Vaccinations
h1 <- hist(life_data$dtp_vax_no,
main = "DTP Vaccine",
xlab = "DTP Vaccine Number",
col = "blue",
border = "black")
h1_log <- hist(log(life_data$dtp_vax_no),
main = "Log-Transformed Distribution of DTP Vaccinations",
xlab = "DTP Vaccine Number",
col = "blue",
border = "black")
The distribution of the DTP vaccine is right skewed. However, log transformation improves the distribution of dtp vaccine numbers enough for it to be viably used in linear regression.
## Print the histogram object to see details
print(h1)
# Histogram for the Measles Vaccine
## Convert vaccination numbers to numeric to generate histograms
life_data$measles_vax_no <- as.numeric(life_data$measles_vax_no)
## Check for any conversion warnings or issues
summary(life_data$measles_vax_no)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.0 3.0 7.0 12.4 16.0 63.0 2
## Now we create the histogram for measles vaccines
h2 <- hist(life_data$measles_vax_no,
main = "Measles Vaccine",
xlab = "Measles Vaccine Number",
col = "blue",
border = "black")
h2_log <- hist(log(life_data$measles_vax_no),
main = "Log-Transformed Distribution of Measeles Vaccinations",
xlab = "Measles Vaccine Number",
col = "blue",
border = "black")
The distribution of the measles vaccine variable is similarly right-skewed. However, log transformation improves the distribution enough for it to be viably used in linear regression.
##Print the histogram object to see details
print(h2)
Before conducting a multiple linear regression, we will explore the relationship between each vaccination variable and life expectancy using simple linear regression. This will help us understand the direct effect of each vaccination variable individually.
model_dtp <- lm(life ~ log(dtp_vax_no), data = life_data)
# Summary of the regression results
summary(model_dtp)
##
## Call:
## lm(formula = life ~ log(dtp_vax_no), data = life_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9865 -3.9050 0.0387 5.0733 12.3582
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 77.1717 0.6906 111.745 < 2e-16 ***
## log(dtp_vax_no) -3.4699 0.3960 -8.762 1.22e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.194 on 185 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.2933, Adjusted R-squared: 0.2894
## F-statistic: 76.77 on 1 and 185 DF, p-value: 1.216e-15
# Confidence intervals
confint(model_dtp)
## 2.5 % 97.5 %
## (Intercept) 75.809180 78.534137
## log(dtp_vax_no) -4.251273 -2.688621
plot(model_dtp)
Based on the above plots log transformation of the explanatory variable enables the model to meet the assumptions necessary for valid linear regression.
model_measles <- lm(life ~ log(measles_vax_no), data = life_data)
# Summary of the regression results
summary(model_measles)
##
## Call:
## lm(formula = life ~ log(measles_vax_no), data = life_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.8365 -4.7789 -0.4665 4.7904 11.2962
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 78.9344 0.8955 88.149 < 2e-16 ***
## log(measles_vax_no) -3.2879 0.3988 -8.244 2.99e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.301 on 185 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.2687, Adjusted R-squared: 0.2647
## F-statistic: 67.97 on 1 and 185 DF, p-value: 2.994e-14
# Confidence intervals
confint(model_measles)
## 2.5 % 97.5 %
## (Intercept) 77.167767 80.701041
## log(measles_vax_no) -4.074658 -2.501105
plot(model_measles)
Based on the above plots log transformation of the explanatory variable enables the model to meet the assumptions necessary for valid linear regression.
DTP Vaccinations: The simple linear regression model for DTP vaccinations shows that the coefficient is -3.47 , with a p-value of 1.22e-15. The model suggests that a 1log unit increase of DTP vaccinations is associated with a decrease of 0.035 years in life expectancy. Additionally the p-value suggests that this relationship is statistically significant.
Measles Vaccinations: The simple linear regression model for measles vaccinations shows that the coefficient is -3.29 , with a highly significant p-value of 2.99e-14. This model suggests that a 1log unit increase in measles vaccinations is associated with a decrease in life expectancy by 0.033 years. This negative relationship is statistically significant as denoted by the p-value.
We aim to test the following hypotheses:
Null Hypothesis (H0): Vaccination rates (DTP and Measles) have no effect on life expectancy.
Alternative Hypothesis (H1): Vaccination rates (DTP and Measles) have significant effect on life expectancy.
We will use multiple linear regression to test these hypotheses.
# Linear regression model for DTP and Measles vaccinations
model_vaccines <- lm(life ~ log(dtp_vax_no) + log(measles_vax_no), data = life_data)
# Summary of regression results
summary(model_vaccines)
##
## Call:
## lm(formula = life ~ log(dtp_vax_no) + log(measles_vax_no), data = life_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.2119 -4.3699 -0.1861 5.1011 11.4672
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 78.4765 0.8796 89.217 < 2e-16 ***
## log(dtp_vax_no) -2.2557 0.6481 -3.481 0.000625 ***
## log(measles_vax_no) -1.5079 0.6415 -2.351 0.019807 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.12 on 184 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.3139, Adjusted R-squared: 0.3064
## F-statistic: 42.09 on 2 and 184 DF, p-value: 8.902e-16
# Confidence intervals
confint(model_vaccines)
## 2.5 % 97.5 %
## (Intercept) 76.741072 80.2119215
## log(dtp_vax_no) -3.534243 -0.9770873
## log(measles_vax_no) -2.773581 -0.2422205
plot(model_vaccines)
Similar to the previous plots for the two explanatory variables in isolation, the multiple regression model also meets the assumptions required for valid linear regression.
We conducted a multiple linear regression analysis to examine the relationship between life expectancy and two predictor variables: DTP vaccinations and measles vaccinations. Below is the interpretation of the results:
The intercept is 78.48, meaning that when the number of DTP and measles vaccinations are zero, the predicted life expectancy would be 78.48 years. This value reflects the expected life expectancy in the absence of vaccinations.
The coefficient for DTP vaccinations is -2.26, which suggests a slight negative relationship between DTP vaccinations and life expectancy. However we have a p-value = 0.0006 meaning we do have strong evidence to suggest that DTP vaccinations have a meaningful impact on life expectancy in this dataset.
Based on the model, a 1log unit increase in people vaccinated against diphtheria is associated with a decrease in life expectancy of 0.0226 years.
The coefficient for measles vaccinations is -1.51, indicating a negative relationship between measles vaccinations and life expectancy. and we notice a p-value = 0.0198. This p-value is statistically significant, meaning that there is strong evidence that measles vaccinations are associated with a decrease in life expectancy, which is counterproductive.
Based on the model, a 1log unit increase in people vaccinated against measles is associated with a decrease in life expectancy by 0.015 years. This is an unexpected result, as we would generally anticipate a positive or a neutral relationship between vaccinations and life expectancy. This could indicate potential issues such as other variables or specific conditions in some countries where high vaccination rates correspond with other negative factors.
The 95% confidence interval for the DTP coefficient is [-3.53, -0.97], which does not include zero. This confirms that the effect of DTP vaccinations is statistically significant, as zero is not plausible value for the coefficient.
The confidence interval for measles vaccinations is [-2.77, -0.24], which does not include zero, reinforcing the statistically significant negative association between measles vaccinations and life expectancy.
DTP vaccinations do show a statistically significant negative association with life expectancy in this dataset, and the relationship appears strong. Measles vaccinations also present with a statistically significant negative association with life expectancy, both of which is unexpected. This finding should be explored further to understand potential confounding factors or data quality issues. Possible avenues include exploring interactions with other variables or including additional health and socio-economic indicators in the model.
The earlier simple linear regression identified that vaccine rates alone are unable to fully explain variations in life expectancy across countries. Logically the next step in the research is to further explore additional variables that may serve as better predictors of life expectancy with the hope of producing a more conclusive statistical model. Multiple linear regression was chosen as the tool of choice for this analysis due to its ability to accommodate multiple explanatory variables of various data types.
| Variable | Rationale for Inclusion |
| Total Population | Variance in population numbers could have an impact on life expectancy via a variety of economic and social factors. |
| Total Fertility Rate | The number of children a woman has could acutely affect individual life expectancy and could also serve as a signal for wider societal dysfunction. |
| Infant Mortality | Infant mortality likely significantly impacts life expectancy measures as naturally the associated low age of death would substantially affect the average. |
| Urban Population | Variation in urban population percentages could contribute to variability in life expectancy through disease rates, crime prevalence, and overall economic opportunity. |
| Prison Population | Higher prison population could be indicative of increased violent crime; prisons can also serve as a vector to spread disease. Both of which could have a significant impact on life expectancy. |
| Tuberculosis Incidence | Tuberculosis is a somewhat preventable but serious communicable disease that could serve as a proxy measure of overall healthcare infrastructure. It also has the potential to influence life expectancy in a number of downstream associations due to infection complications. |
| Median Age | Median age is likely to be a confounding variable for many of the explanatory variables included in this analysis; as such it has been incorporated to add depth and validity to drawn conclusions. |
| Mean Years of Schooling | Highly educated individuals are probably likely to be more health conscious; thus, mean years of schooling has been included in the analysis to evaluate the statistical strength of the relationship between it and life expectancy. |
| Depth of Food Deficit | Naturally if a country has less food its people are probably more likely to be malnourished, increasing risk of disease and death and likley contributing to reduced life expectancy measures. |
| Total GDP | This variable represents the economic strength of individual countries; in this case the hypothesized relationship is that economic strength facilitates better healthcare and lifestyles for citizens, potentially improving life expectancy. |
| Migration | Higher rates of migration could serve as an additional vector for communicable disease; on the other hand high migration rates could indicate a desirable country to live in economically. |
##install.packages("corrplot")
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.3
## corrplot 0.94 loaded
## preprocessing
country_data <- read.csv("dataset/UNHDD 2020.csv", skip = 1)
country_data[country_data[, 1:70] == '..'] <- NA
NA_counts <- data.frame(sapply(country_data, function(x) sum(is.na(x))))
Independence is assumed per the nature of the dataset.
## parsing dataset for only relevant variables
country_data_regression <- country_data %>%
select('pop', 'fertility', 'mort_infant', 'urban', 'tb', 'age_median', 'educ_mean', 'food_deficit', 'gdp_total', 'prison', 'migration', 'life') %>%
mutate_if(is.character, as.numeric)
## corrplot analysis
corrplot(cor(country_data_regression, use="pairwise.complete.obs"), method = 'number',
tl.cex = 1, # Increase text label size
tl.srt = 45, # Rotate text labels by 45 degrees for better readability
number.cex = 0.7 # number size in the table
)
Some variables are not correlated at all with life expectancy or are highly correlated (>0.8) with other explanatory variables so they should be removed to improve the interpretability of the results.
## remove highly correlated variables (> 0.8)
country_data_regression <- country_data %>%
select('urban', 'tb', 'fertility', 'educ_mean', 'food_deficit', 'gdp_total', 'prison', 'migration', 'life') %>%
mutate_if(is.character, as.numeric)
##corrplot analysis
corrplot(cor(country_data_regression, use="pairwise.complete.obs"), method = 'number',
tl.cex = 1, # Increase text label size
tl.srt = 45, # Rotate text labels by 45 degrees for better readability
number.cex = 0.7 # number size in the table
)
After removing irrelevant and highly correlated variables (pop, mort_infant, age_median), multicollinearity is improved and all explanatory variables have some form of linear relationship with life expectancy.
Fitting the regression model for use in evaluation of assumptions:
## fit regression model
regression_model <- lm(life ~
+ urban + fertility + tb + prison + migration +
+ educ_mean + food_deficit + gdp_total, country_data_regression)
## plot residuals via scatter plot
plot(regression_model, 1)
No distinct non-linear patterns can be discerned in the above scatter plot of residuals versus fitted values indicating the variance of the residuals is generally constant.
## qqline plot to assess normality
qqnorm(resid(regression_model))
qqline(resid(regression_model))
Based on the qqplot the residuals appear to generally be normally distributed as they for the most part closely follow the line marked in the plot.
Null - None of the selected predictor variables significantly contribute to life expectancy.
Alternative - At least one of the selected predictor variables significantly contributes to life expectancy.
All assumptions are effectively satisfied so we can move forward with summarizing and interpreting the results of the multiple linear regression model.
## Return regression results
summary(regression_model)
##
## Call:
## lm(formula = life ~ +urban + fertility + tb + prison + migration +
## +educ_mean + food_deficit + gdp_total, data = country_data_regression)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9634 -1.9587 -0.0238 2.1353 9.7838
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.143e+01 2.984e+00 23.941 < 2e-16 ***
## urban 2.740e-02 1.495e-02 1.833 0.06873 .
## fertility -2.938e+00 3.291e-01 -8.926 1.19e-15 ***
## tb -1.277e-02 2.146e-03 -5.948 1.74e-08 ***
## prison -3.184e-03 2.175e-03 -1.464 0.14518
## migration 2.029e-01 6.497e-02 3.124 0.00213 **
## educ_mean 3.611e-01 1.380e-01 2.616 0.00977 **
## food_deficit 5.313e-02 2.023e-02 2.626 0.00950 **
## gdp_total -7.591e-07 9.780e-05 -0.008 0.99382
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.081 on 155 degrees of freedom
## (25 observations deleted due to missingness)
## Multiple R-squared: 0.8367, Adjusted R-squared: 0.8283
## F-statistic: 99.28 on 8 and 155 DF, p-value: < 2.2e-16
Overall the multiple regression model with the selected variables performs well on this data; returning an R2 of .8367 and a significant p-value less than 2.2e-16. The p-value is significant so we can reject the null hypothesis and assume that there is strong statistical evidence that at least one of the predictor variables contributes effectively to life expectancy. Additionally, the R2 of .8367 is also strong, indicating that the set of predictor variables included in the model are responsible for about 84% of the change in the response variable.
Using p-value as an evaluation metric, most of the included variables are identified as significant contributors to the model: migration, educ_mean, and food deficit all reported p-values less than the 0.01 significance threshold. Fertility and tb were even stronger contributors, reporting p-values less than the 0.001 significance threshold, making them the most significant contributors in the variable set. Of the significant variables, fertility and tb both present with a negative coefficient, indicating a negative relationship with life expectancy. Conversely, migration, educ_mean, and food_deficit all present with positive coefficients indicating that they are positively associated with life expectancy.
Interestingly fertility is the most significant factor in life expectancy based on the multiple regression model. A possible explanation for the observed negative association could be that families in less developed countries have higher number of children to support lifestyles that are agriculture and labor oriented, and those less developed countries present with other factors that reduce life expectancy such as less developed overall healthcare infrastructure.
The explanation for tuberculosis incidence as the second most important variable is likely more straightforward in that tuberculosis probably has a direct negative effect on overall health of a population, especially in developing countries with less developed healthcare systems. The significance of migration, mean years of education, and food deficit in terms of their relationship with life expectancy is probably that they are all possible proxy variables for a countries economic status. More economically successful countries will have more incoming immigration, a more educated population, and a lower food deficit.
Naturally economically successful countries will be able to spend more money on the healthcare of their citizens, improving life expectancy. Food deficit also likely has a direct impact on life expectancy in that malnourished individuals are probably likely to experience a general reduction in health. Overall the multiple regression analysis was able to identify a set of strong predictor variables that explained variation in life expectancy across countries with reasonable accuracy. The model’s performance will likely improve with the inclusion of additional high quality variables.